Transformers have attained superior performance in natural language processing and computer vision. Their self-attention and feedforward layers are overparameterized, limiting inference speed and energy efficiency. Tensor decomposition is a promising technique to reduce parameter redundancy by leveraging tensor algebraic properties to express the parameters in a factorized form. Prior efforts used manual or heuristic factorization settings without hardware-aware customization, resulting in poor hardware efficiencies and large performance degradation. In this work, we propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions and automates the choice of tensorization shape and decomposition rank with hardware-aware co-optimization. We jointly investigate tensor contraction path optimizations and a fused Einsum mapping strategy to bridge the gap between theoretical benefits and real hardware efficiency improvement. Our two-stage knowledge distillation flow resolves the trainability bottleneck and thus significantly boosts the final accuracy of factorized Transformers. Overall, we experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss and achieve a better efficiency-accuracy Pareto frontier than hand-tuned and heuristic baselines.
translated by 谷歌翻译
数据增强是一种提高深神经网络(DNN)的鲁棒性的简单而有效的方法。多样性和硬度是数据增强的两个互补维度,以实现稳健性。例如,Augmix探讨了各种增强套的随机组成,以增强更广泛的覆盖,而对抗性培训产生过态度硬质样品以发现弱点。通过此激励,我们提出了一个数据增强框架,被称为奥古曼克,统一多样性和硬度的两个方面。 Augmax首先将多个增强运算符进行随机样本,然后学习所选操作员的对抗性混合物。作为更强大的数据增强形式,奥格梅纳队导致了一个明显的增强输入分布,使模型培训更具挑战性。为了解决这个问题,我们进一步设计了一个解散的归一化模块,称为Dubin(双批次和实例规范化),其解除了奥古曼克斯出现的实例 - 明智的特征异质性。实验表明,Augmax-Dubin将显着改善分配的鲁棒性,优于现有技术,在CiFar10-C,CiFar100-C,微小Imagenet-C和Imagenet-C上以3.03%,3.49%,1.82%和0.71%。可提供代码和预磨料模型:https://github.com/vita-group/augmax。
translated by 谷歌翻译
深度神经网络一直是分类任务成功的推动力,例如对象和音频识别。许多最近提出的架构似乎已经取得了令人印象深刻的结果和概括,其中大多数似乎是断开连接的。在这项工作中,我们在统一框架下对深层分类器进行了研究。特别是,我们以输入的不同程度多项式的形式表达最新的结构(例如残留和非本地网络)。我们的框架提供了有关每个模型的电感偏差的见解,并可以在其多项式性质上进行自然扩展。根据标准图像和音频分类基准评估所提出模型的功效。提出的模型的表达性既是在增加模型性能和模型压缩方面都突出的。最后,在存在有限的数据和长尾数据分布的情况下,此分类法所允许的扩展显示。我们希望这种分类法可以在现有特定领域的架构之间提供联系。源代码可在\ url {https://github.com/grigorisg9gr/polynomials-for-aigmenting-nns}中获得。
translated by 谷歌翻译
Generalisation to unseen contexts remains a challenge for embodied navigation agents. In the context of semantic audio-visual navigation (SAVi) tasks, the notion of generalisation should include both generalising to unseen indoor visual scenes as well as generalising to unheard sounding objects. However, previous SAVi task definitions do not include evaluation conditions on truly novel sounding objects, resorting instead to evaluating agents on unheard sound clips of known objects; meanwhile, previous SAVi methods do not include explicit mechanisms for incorporating domain knowledge about object and region semantics. These weaknesses limit the development and assessment of models' abilities to generalise their learned experience. In this work, we introduce the use of knowledge-driven scene priors in the semantic audio-visual embodied navigation task: we combine semantic information from our novel knowledge graph that encodes object-region relations, spatial knowledge from dual Graph Encoder Networks, and background knowledge from a series of pre-training tasks -- all within a reinforcement learning framework for audio-visual navigation. We also define a new audio-visual navigation sub-task, where agents are evaluated on novel sounding objects, as opposed to unheard clips of known objects. We show improvements over strong baselines in generalisation to unseen regions and novel sounding objects, within the Habitat-Matterport3D simulation environment, under the SoundSpaces task.
translated by 谷歌翻译
This paper analyzes $\ell_1$ regularized linear regression under the challenging scenario of having only adversarially corrupted data for training. We use the primal-dual witness paradigm to provide provable performance guarantees for the support of the estimated regression parameter vector to match the actual parameter. Our theoretical analysis shows the counter-intuitive result that an adversary can influence sample complexity by corrupting the irrelevant features, i.e., those corresponding to zero coefficients of the regression parameter vector, which, consequently, do not affect the dependent variable. As any adversarially robust algorithm has its limitations, our theoretical analysis identifies the regimes under which the learning algorithm and adversary can dominate over each other. It helps us to analyze these fundamental limits and address critical scientific questions of which parameters (like mutual incoherence, the maximum and minimum eigenvalue of the covariance matrix, and the budget of adversarial perturbation) play a role in the high or low probability of success of the LASSO algorithm. Also, the derived sample complexity is logarithmic with respect to the size of the regression parameter vector, and our theoretical claims are validated by empirical analysis on synthetic and real-world datasets.
translated by 谷歌翻译
Bayesian methods, distributionally robust optimization methods, and regularization methods are three pillars of trustworthy machine learning hedging against distributional uncertainty, e.g., the uncertainty of an empirical distribution compared to the true underlying distribution. This paper investigates the connections among the three frameworks and, in particular, explores why these frameworks tend to have smaller generalization errors. Specifically, first, we suggest a quantitative definition for "distributional robustness", propose the concept of "robustness measure", and formalize several philosophical concepts in distributionally robust optimization. Second, we show that Bayesian methods are distributionally robust in the probably approximately correct (PAC) sense; In addition, by constructing a Dirichlet-process-like prior in Bayesian nonparametrics, it can be proven that any regularized empirical risk minimization method is equivalent to a Bayesian method. Third, we show that generalization errors of machine learning models can be characterized using the distributional uncertainty of the nominal distribution and the robustness measures of these machine learning models, which is a new perspective to bound generalization errors, and therefore, explain the reason why distributionally robust machine learning models, Bayesian models, and regularization models tend to have smaller generalization errors.
translated by 谷歌翻译
The feasibility of collecting a large amount of expert demonstrations has inspired growing research interests in learning-to-drive settings, where models learn by imitating the driving behaviour from experts. However, exclusively relying on imitation can limit agents' generalisability to novel scenarios that are outside the support of the training data. In this paper, we address this challenge by factorising the driving task, based on the intuition that modular architectures are more generalisable and more robust to changes in the environment compared to monolithic, end-to-end frameworks. Specifically, we draw inspiration from the trajectory forecasting community and reformulate the learning-to-drive task as obstacle-aware perception and grounding, distribution-aware goal prediction, and model-based planning. Firstly, we train the obstacle-aware perception module to extract salient representation of the visual context. Then, we learn a multi-modal goal distribution by performing conditional density-estimation using normalising flow. Finally, we ground candidate trajectory predictions road geometry, and plan the actions based on on vehicle dynamics. Under the CARLA simulator, we report state-of-the-art results on the CARNOVEL benchmark.
translated by 谷歌翻译
Echo State Networks (ESN) are a type of Recurrent Neural Networks that yields promising results in representing time series and nonlinear dynamic systems. Although they are equipped with a very efficient training procedure, Reservoir Computing strategies, such as the ESN, require the use of high order networks, i.e. large number of layers, resulting in number of states that is magnitudes higher than the number of model inputs and outputs. This not only makes the computation of a time step more costly, but also may pose robustness issues when applying ESNs to problems such as Model Predictive Control (MPC) and other optimal control problems. One such way to circumvent this is through Model Order Reduction strategies such as the Proper Orthogonal Decomposition (POD) and its variants (POD-DEIM), whereby we find an equivalent lower order representation to an already trained high dimension ESN. The objective of this work is to investigate and analyze the performance of POD methods in Echo State Networks, evaluating their effectiveness. To this end, we evaluate the Memory Capacity (MC) of the POD-reduced network in comparison to the original (full order) ENS. We also perform experiments on two different numerical case studies: a NARMA10 difference equation and an oil platform containing two wells and one riser. The results show that there is little loss of performance comparing the original ESN to a POD-reduced counterpart, and also that the performance of a POD-reduced ESN tend to be superior to a normal ESN of the same size. Also we attain speedups of around $80\%$ in comparison to the original ESN.
translated by 谷歌翻译
Reinforcement learning (RL) is a promising solution for autonomous vehicles to deal with complex and uncertain traffic environments. The RL training process is however expensive, unsafe, and time consuming. Algorithms are often developed first in simulation and then transferred to the real world, leading to a common sim2real challenge that performance decreases when the domain changes. In this paper, we propose a transfer learning process to minimize the gap by exploiting digital twin technology, relying on a systematic and simultaneous combination of virtual and real world data coming from vehicle dynamics and traffic scenarios. The model and testing environment are evolved from model, hardware to vehicle in the loop and proving ground testing stages, similar to standard development cycle in automotive industry. In particular, we also integrate other transfer learning techniques such as domain randomization and adaptation in each stage. The simulation and real data are gradually incorporated to accelerate and make the transfer learning process more robust. The proposed RL methodology is applied to develop a path following steering controller for an autonomous electric vehicle. After learning and deploying the real-time RL control policy on the vehicle, we obtained satisfactory and safe control performance already from the first deployment, demonstrating the advantages of the proposed digital twin based learning process.
translated by 谷歌翻译
将规则无缝整合到学习中(LFD)策略是启用AI代理的现实部署的关键要求。最近,信号时间逻辑(STL)已被证明是将规则作为时空约束的有效语言。这项工作使用蒙特卡洛树搜索(MCT)作为将STL规范集成到香草LFD策略中以提高约束满意度的一种手段。我们建议以STL鲁棒性值来增强MCT启发式,以使树的搜索偏向具有更高限制满意度的分支。虽然无域的方法可以应用于将STL规则在线整合到任何预训练的LFD算法中,但我们选择目标条件的生成对抗性模仿学习作为离线LFD策略。我们将提出的方法应用于规划轨迹的领域,用于在非较低机场周围的通用航空飞机。使用对现实世界数据进行训练的模拟器的结果显示了60%的性能比不使用STL启发式方法的基线LFD方法提高了性能。
translated by 谷歌翻译